Extension of Zipf's Law to Word and Character N-grams for English and Chinese

نویسندگان

  • Le Quan Ha
  • Elvira I. Sicilia-Garcia
  • Ji Ming
  • Francis Jack Smith
چکیده

It is shown that for a large corpus, Zipf 's law for both words in English and characters in Chinese does not hold for all ranks. The frequency falls below the frequency predicted by Zipf's law for English words for rank greater than about 5,000 and for Chinese characters for rank greater than about 1,000. However, when single words or characters are combined together with n-gram words or characters in one list and put in order of frequency, the frequency of tokens in the combined list follows Zipf’s law approximately with the slope close to -1 on a loglog plot for all n-grams, down to the lowest frequencies in both languages. This behaviour is also found for English 2-byte and 3-byte word fragments. It only happens when all n-grams are used, including semantically incomplete n-grams. Previous theories do not predict this behaviour, possibly because conditional probabilities of tokens have not been properly represented.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Reduced n-gram Models for English and Chinese Corpora

Statistical language models should improve as the size of the n-grams increases from 3 to 5 or higher. However, the number of parameters and calculations, and the storage requirement increase very rapidly if we attempt to store all possible combinations of n-grams. To avoid these problems, the reduced n-grams’ approach previously developed by O’Boyle (1993) can be applied. A reduced n-gram lang...

متن کامل

Characteristics of character usage in Chinese Web searching

The use of non-English Web search engines has been prevalent. Given the popularity of Chinese Web searching and the unique characteristics of Chinese language, it is imperative to conduct studies with focuses on the analysis of Chinese Web search queries. In this paper, we report our research on the character usage of Chinese search logs from a Web search engine in Hong Kong. By examining the d...

متن کامل

Extension of Zipf's Law to Words and Phrases

Zipf’s law states that the frequency of word tokens in a large corpus of natural language is inversely proportional to the rank. The law is investigated for two languages English and Mandarin and for ngram word phrases as well as for single words. The law for single words is shown to be valid only for high frequency words. However, when single word and n-gram phrases are combined together in on...

متن کامل

BLEU in Characters: Towards Automatic MT Evaluation in Languages without Word Delimiters

Automatic evaluation metrics for Machine Translation (MT) systems, such as BLEU or NIST, are now well established. Yet, they are scarcely used for the assessment of language pairs like English-Chinese or English-Japanese, because of the word segmentation problem. This study establishes the equivalence between the standard use of BLEU in word n-grams and its application at the character level. T...

متن کامل

Character Distributions of Classical Chinese Literary Texts: Zipf's Law, Genres, and Epochs

We collect 14 representative corpora for major periods in Chinese history in this study. These corpora include poetic works produced in several dynasties, novels of the Ming and Qing dynasties, and essays and news reports written in modern Chinese. The time span of these corpora ranges between 1046 BCE and 2007 CE. We analyze their character and word distributions from the viewpoint of the Zipf...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • IJCLCLP

دوره 8  شماره 

صفحات  -

تاریخ انتشار 2003